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Subsystems that are coupled due to dynamics and costs arise naturally in various communication 



applications. In many such applications the control actions are shared between different control stations 
giving rise to a control sharing information structure. Previous studies of control-sharing have con- 
centrated on the linear quadratic Gaussian setup and a solution approach tailored to continuous valued 
, control actions. In this paper a three step solution approach for finite valued control actions is presented. 

In the first step, a person-by-person approach is used to identify redundant data or a sufficient statistic 



CZ2 

O ■ for local information at each control station. In the second step, the common-information based approach 

of Nayyar et al. (2011) is used to find a sufficient statistic for the common information shared between 
> : a.! control stations and ■„ obtain , dynamic prog—ng decomposition. In the thin, step, the specincs 

£Nj . of the model are used to simplify the sufficient statistic and the dynamic program. As an example, an 

\^ ■ exact solution of a two-user multiple access broadcast system is presented. 

I. Introduction 

A. Motivation 

• i-^ ■ 
>< 

5_1 ' In this paper, we investigate a modular architecture for networked control systems that consists 

c3 ! 

of a collection of dynamically coupled subsystems, each with a local control station. Each control 
station observes, either fully or partially, the state of its subsystem, but does not observe the state 
of other subsystems. 1 In addition, each control station observes the control action of all other 
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'in the formal description of the model Section II- A, we also assume that, in addition to the local observations, the control 
stations also observe a shared state. With this slight generalization, the model can capture more applications, but does not add 
any additional conceptual difficulties. For that reason, we do not include the shared state in the discussion in this section. 



August 9, 2012 



DRAFT 



2 



control stations with one- step delay. Such a control sharing happens naturally in applications 
like multi-access broadcast [1], [2] (see Section VI), paging and registration in mobile cellular 
systems [3], real-time communication with feedback [4], and sensor networks [5]. 

Each control station affects the state evolution of each substation; thus the substation have 
coupled dynamics. The per unit cost depends on the state of all subsystems and the control 
actions of all control stations; thus the control stations are coupled through cost. No control 
station knows the information available to other control stations. Hence, the system has a non- 
classical information structure [6], [7]. 

Each control station has a perfect recall, that is, it chooses a control action based on the history 
of its observations and control actions. Since the domain of the control laws increases with time, 
we need to find a time-homogeneous sufficient statistic for the past data at each controller to 
pose and solve the infinite horizon optimal control problem. Finding such sufficient statistics is 
difficult due to the non-classical nature of information. For systems with classical information, 
a sufficient statistic at a control station captures the affect of past data (at that control station) 
on future estimation (at that control station). This feature is called the dual affect of control. 
For systems with non-classical information, in addition to the above, a sufficient statistic at a 
control station must capture the affect of past control actions (at that control station) on the 
future estimation at other control stations. This feature is called the triple affect of control — 
the third affect being the signaling affect. The control sharing information structure makes the 
signaling affect explicit; as such solution techniques to control sharing provide insights to other 
non-classical information structures where the signaling affect is implicit. 

B. Literature Overview 

There are only a few general frameworks of dynamic programming for systems with non- 
classical information structure: the sequential team approach for finite horizon systems [8], a 
common-information based approach for finite horizon systems [9], and a two-step solution 
approach for two-agent finite and infinite horizon systems [10]. We are interested in a solution 
framework that works for multiple control stations and extends to infinite horizon systems, and 
hence, these generic dynamic programming approaches are not applicable. 

Most of the research on non-classical information structure has focused on specific system- 
dynamics and/or specific information structures. We briefly describe some of these approaches 
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below (see [11] for a detailed discussion.) 

The special case of linear dynamics and Gaussian disturbances with specific information 
structures has received considerable attention in the literature. Examples include static teams [12], 
[13], partially nested teams [14], [15], stochastically nested teams [16], and quadratic invariant 
teams [17]— [19] We are interested in systems with non-linear dynamics. More importantly, the 
control- sharing information structure is neither static, nor partially nested, nor stochastically 
nested, nor quadratically invariant. 

The special case of non-classical information structure with specific data sharing between 
the control stations has also received considerable attention in the literature. Examples include 
delayed-state observation [20], delayed (observation) sharing [6], [21], [22], control sharing [23], 
[24], periodic sharing [25], belief sharing [16], and partial history sharing [26]. Out of these, 
the models closest to our setup are control- sharing and partial history sharing. 

As described earlier, in a control- sharing information structure, control stations can directly 
signal to one another through their control actions. This signaling aspect was exploiting in [23], 
[24] by explicitly embedding the local observations in the control actions with arbitrary small 
perturbation of the control action. Their embedding technique relies on: (i) real-valued random 
variables have infinite information (in an information theoretic sense); and (ii) the existence of 
measurable bijections between Euclidean spaces. Such an embedding of observations converts 
the control sharing information structure to a one-step delayed (observation) sharing information 
structure, which is also a partially nested information structure. Then, the solution techniques 
for partially nested teams give an approximate solution for the control-sharing information struc- 
ture [24]. However, our motivation for investigating these models comes from communication 
networks, most of which have finite valued control actions. 2 Embedding observations in finite 
valued control actions is not possible. Hence, the solution technique of [23], [24] does not work 
for finite valued action spaces. 

In a system with control sharing, each control station knows part of the history of data at all 
control stations. Thus, a control- sharing information structure is also a partial history sharing 
information structure, for which the following solution approach is proposed in [26]. Split the 

2 Even otherwise, the assumption of noiseless sharing continuous valued control actions is not realistic in communication 
applications because it requires infinite capacity communication channels. 
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data available at each control station into two parts: a common information part that is commonly 
shared amongst all control stations, and a local information part that consists of the remaining 
data. Then, the decentralized stochastic control problem is equivalent to a centralized centralized 
stochastic control problem in which a fictitious coordinator observes the common information 
and chooses functions that map the local information at each control station to its action. This 
solution approach extends to infinite horizons only when the location information is not increasing 
with time — which is not the case in the above model because the local information, which is the 
history of local state observations, is increasing with time. Hence, the solution approach of [26] 
is not directly applicable to control sharing information structures. 

C. Contributions of the paper 

One of the main difficulties in obtaining dynamic programming decomposition for decentral- 
ized stochastic control is to identify sufficient statistics (for each control station) that summarize 
the affect of the history of their observations and actions on future observations and cost. In this 
paper, we present a three step approach to find such sufficient statistics for decentralized control 
of dynamically coupled subsystems with control sharing. 

In the first step, we use a person-by-person approach and identify either irrelevant data or 
a sufficient statistic for part of the data at each control station. In the second step, we use the 
common information approach of [26] and identify sufficient statistic for the common information 
at all control stations. In the third step, we use the salient features of the model — full or partial 
observation of local states, dynamic coupling using control actions, and sharing of control 
actions — to simplify the sufficient statistic obtained in the second step. Using the sufficient 
statistics of the second step (and their simplification in the third step), we obtain a dynamic 
programming decomposition which can be extended to infinite horizon discounted cost setup. 
Such a dynamic programming decomposition is not possible by using either the person-by-person 
approach or the common-information approach alone. 

We use the proposed solution approach to obtain a dynamic programming decomposition for 
a multiuser broadcast channel, and analytically solve the dynamic program when both users have 
the same arrival rates. Although this example is very well studied, this is the first result that 
provides a dynamic programming decomposition for this model. 

The rest of this paper is organized as follows. We present two models for coupled subsystems 
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with control sharing in Section II: the full and partial observation models. We present the above 
described three step approach for the full observation model in Section III and for the partial 
observation model in Section IV. We present an example of a two-user multiaccess broadcast 
channel in Section VI and conclude in Section VII. 

D. Notation 

Random variables are denoted with upper case letters (X, Y, etc.), their realization with lower 
case letters (x, y, etc.), and their space of realizations by script letters (X, y, etc.). Subscripts 
denote time and superscripts denote the subsystem; e.g. X\ denotes the state of subsystem i 
at time t. The short hand notation X'[. t denotes the vector . . . , X\). Bold face letters 

denotes the collection of variables at all subsystems; e.g., X t denotes (JQ 1 , X t 2 , . . . X™). The 
notation denotes the vector (X}, Xf~ x , X t i+1 , . . . , X™). 

A(X) denotes the probability simplex on the space X. P(A) denotes the probability of an 
event A, and E[X] denotes the expectation of a random variable X. t[x = y] denotes the 
indicator function of the statement x = y, i.e., t[x — y] — 1 if x — y and otherwise. Let IN 
denote the set of natural numbers and Z + denote the set of non-negative integers. 

II. Coupled subsystems with control sharing 
A. Model and Problem Formulation 

System components: Consider a discrete-time networked control system with n subsystems. 
The state (Z t ,Xf) of subsystem i, i — 1, . . . , n, has two components: a local state X\ E X % 
and a shared state Z t E Z, which is identical for all subsystems. The initial shared state Z 1 
has a distribution Pz- Conditioned on the initial shared state Z\, the initial local state of all 
subsystems are independent; initial local state X\ is distributed according to Px^\z^ i = 1, . . . , n. 
Let X 4 := (Xj, . . . , X™) denote the local state of all subsystems. 

A control station is co-located with each subsystem. Let U\ E W denote the control action of 
control station i and XJ t '■= (U^,Uf,..., U™) denote the collection of all control actions. 

System dynamics: The shared and the local state of each subsystems are coupled through the 
control actions; the shared state evolves according to 

Z t+1 = f?(Z t ,U t ,W t °) (1) 
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while the local state of subsystem i, i = 1, . . . , n, evolves according to: 

xi +1 = fi(z t ,xi,u t ,wi) (2) 

where W t l 6 W\ i — 0, 1, . . . , n, is the plant disturbance with distribution P W i. The processes 
{W/, i = 1, . . . }, i = 0,1, ... ,n, are assumed to be independent across time, independent of 
each other, and also independent of the initial state (Zi,Xi) of the system. 

Note that the updated local state of subsystem i depends only on the previous local state of 
subsystem i and previous shared state but is controlled by all control stations. 

Observation models and information structures: We consider two observation models that 
differ in the observation of the location state X\ at control station i. In the first model, called 
full observation model, control station i perfectly observes the local state X\\ in the second 
model, called partial observation model, control station i observes a noisy version Y t l G y 1 of 
the local state X\ given by 

Y* = 4(X*,W*) (3) 

where W% E W\ is the observation noise with distribution The processes {Wf, t = 1, . . . , }, 
i — 1, . . . , n are assumed to be independent across time, independent of each other, independent 
of \W[, t — 1, . . . , n}, and independent of the initial states (Xi, Z\). 

In both models, in addition to the local measurements of the state of its subsystem, each 
control station perfectly observes the shared state Z t and the one-step delayed control actions 
Ut_i of all control stations. The control stations perfectly recall all the data they observe. Thus, 
in the full observation model, control station i chooses a control action according to 

[//^(Z^X^U^O (4) 

while in the partial observation model, it chooses a control action according to 

Ui = gl(Z 1 .,,Yl t ,V 1 .,. 1 ). (5) 

The function g\ is called the control law of control station i. The collection g* := (g\,g\, . . . , g l T ) 
of control laws at control station i is called the control strategy of control station i. The collection 
g := (g 1 , g 2 , . . . , g n ) of control strategies of all control stations is called the control strategy of 
the system. 
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Cost and performance: At time t, the system incurs a cost c t (Z t) X t , U t ) that depends on the 
shared state, the local state of all subsystems, and the actions of all control stations. Thus, the 
subsystems are also coupled through cost. 

The system runs for a time horizon T. The performance of a control strategy g is measured 
by the expected total cost incurred by that strategy, which is given by 



where the expectation is with respect to a joint measure of (Z\-t, X 1: t, Ui : t) induced by the 
choice of the control strategy g. 

We are interested in the following optimal control problem: 

Problem 1: Given the distributions P z , Px*\z> ^V*> Pw* °f me initial shared state, initial 
local state, plant disturbance of subsystem i, and observation noise of subsystem i (for the 
partial observation model), i = 1, . . . , n, a horizon T, and the cost functions c t , t = 1, . . . , T, 
find a control strategy g that minimizes the expected total cost given by (6). 

B. Applications in communication networks 

Control- sharing information structure arises naturally in communication networks, as is illus- 
trated by some applications described below. 

1 ) Paging and registration in cellular networks: Consider a mobile cellular network consisting 
of two controllers: a network operator and a mobile station. The local state X\ of the network 
operator is a constant and the local state Xf of the mobile station is its current location that 
changes in a Markovian manner. We will describe the shared state Z t later. The control action 
U} of the network operator is a permutation of X 2 , the set of all possible locations of the mobile, 
and denotes the order in which mobile station will be searched if there is a paging request. The 
control action of the mobile station is either Xf (indicating that the mobile station registers 
with the network) or NR (indicating the mobile station does not register). 

At each time, the network may get an exogenous paging request to seek the location of the 
mobile station. If a paging request is received (denoted by P t = 1), the cost of searching is given 
by the index i(X 2 , U}) of X 2 in U\. If no paging request is received (denoted by P t = 0) and 
the mobile station registers with the network, a registration cost of r is incurred. The process 
P t is a binary-valued Markov process. If either the mobile station is paged or the mobile station 



T 




(6) 
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registers with the network, the network operator learns the current location of the mobile station. 
Let M t denote the time since the last paging or registration request and St denote the location 
of the mobile station at that time. Then Z t = (P t , M t -x, S t -i) is the shared state of the system. 

The above model corresponds to the model of paging and registration in mobile cellular 
network considered in [3]. The control action U} of the network is based on information known 
to the mobile station, hence U\ is effectively observed at the mobile station. The control action 
Uf of the mobile station is communicated to the network operator. Hence, this system has 
control- sharing information structure. 

2) Real-time communication: Consider a real-time communication system consisting of an 
encoder and a decoder. The encoder observes a first-order Markov source St- The local state 
Xl of the encoder is (S t -i, S t ) and the local state Xf of the decoder is a constant. The shared 
state Z t is also a constant. The control action U\ of the encoder is a quantization symbol that is 
communicated to the decoder. The control action of the decoder is an estimate of the one-step 
delayed source S t ~i of the encoder. The cost at each time is given by a distortion between S t -i 
and Ul 

The above model corresponds to the model of real-time communication considered in [27] 
(specialized to infinite memory). The control action JJ\ of the encoder is communicated to the 
decoder. The control action Uf of the decoder is based on the information known to the encoder, 
hence [J? is effectively observed at the encoder. Hence, this system has the full observation model 
considered above with shared state Z t = 0. 

3) Multiaccess broadcast: Consider a two-user multiaccess broadcast system. At time t, W{ E 
{0, 1} packets arrive at each user according to independent Bernoulli processes with P(W t l = 
1) = p l , i = 1, 2. Each user may store only X\ E {0, 1} packets in a buffer. If a packet arrives 
when the user-buffer is full, the packet is dropped. 

Both users may transmit U\ E {0, 1} packets over a shared broadcast medium. A user can 
transmit only if it has a packet, thus U\ < X\. If only one user transmits at a time, the 
transmission is successful and the transmitted packet is removed from the queue. If both users 
transmit simultaneously, packets "collide" and remain in the queue. Thus, the state update for 
user 1 is given by Xh x = max(X f 1 + JJ\ ■ (1 — U^) + Wf, 1). The state update rule for user 2 
is symmetric dual of the above. 

Instead of costs, it is more natural to work with rewards in this example. The objective is 
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to maximize throughput, or the number of successful packet transmissions. Thus, the per unit 
reward is c(x, u) = u 1 © u 2 , where © means binary XOR. 

When the arrival rates at both users are the same (p 1 = p 2 ), the above model corresponds to the 
two-user multiaccess broadcast system considered in [1], [2], [28]. Slight variation of the above 
model were considered in [29], [30]. In recent years, the two-user multiaccess broadcast system 
with asymmetric arrivals (p 1 ^ p 2 ) has been used as a benchmark problem for decentralized 
stochastic control problems in the artificial intelligence community [31]— [35]. 

Due to the broadcast nature of the communication channel, each user observes the transmission 
decision of the other user. Hence the system has the full observation model considered above 
with shared state Z t = 0. We will revisit this model in Section VI. 

III. Main result for the full observation model 

In this section, we derive structure of optimal control laws and a dynamic programming 
decomposition for the full observation model. As stated in the introduction, the full observation 
model has a partial history sharing information structure [26]. Nayyar et al. proposed a common 
information based approach to design systems with partial information sharing. According to 
their approach, the design of optimal control strategies is investigated from the point of view of 
a coordinator that observes the shared common information. In the full observation model, the 
shared common information C t = {Zut, Ui :t _i), and the private local information is L\ = {X\. t }. 
According to [26], the posterior probability P(Z t , L t | C t ) is a sufficient statistic for the shared 
common information C t . 

However, directly using the above approach is not useful for the full observation model because 
the local information L\ at control station i, % — 1, . . . , n, is increasing with time, which causes 
the dimension of the sufficient statistic P(Z t , L t | C t ) to increases with time; and therefore, 
P(Zt, L t | C t ) does not work as a sufficient statistic for infinite horizon setup. 

In this paper, we present the following three step approach to simplify the structure of the 
control laws and derive a dynamic programming decomposition (that extends to the infinite 
horizon setup). 

1) Use a person-by-person approach to show that the past values of the local state X\. t _ x are 
irrelevant at control station i at time t. Thus, for any control strategy of control station i that 
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uses (X{. t , Z-i. t , U 1:t _i), we can choose a control strategy that uses only (X\, Z ltt , U\-.t-i) 
without any loss in performance. 

2) When attention is restricted to control strategies of the form derived in Step 1, the local 
information L\ = {XI} at control station % does not increase with time. Thus, using the 
results of [26], we can show that Y\. t = ~P(X. t ,Z t \ C t ) is a sufficient statistic for the 
common information C t and is also an information state for dynamic programming. 

3) Using the system dynamics, show that U t defined in Step 2 is equivalent to (Z t , ® t ), where 
&t = (&], • • • , 0") and Q\ = P(X 4 * | C t ). Using this equivalence, we can simplify the 
structural result and dynamic programming decomposition of Step 2. 

Now, we describe each of these steps in detail. For simplicity of exposition, we assume that Z, 
X 1 , W , and W\ i = 1, . . . , n, are finite. The results extend to general alphabets under suitable 
technical conditions (similar to those for centralized stochastic control [36]). 

Step 1 : Shedding of irrelevant information 

In this section, we show that the past values of local state X\. t _ x are irrelevant at control 
station i at time t, i — 1, . . . , n. In particular: 

Proposition 1: In the full observation model, restricting attention to control laws of the form 

Ui = gl(XlZ 1 .,,V 1 .,^ 1 ) (7) 

at all control stations i, i — 1, . . . , n, is without loss of optimality. 

A priori, it is not obvious that the past data is irrelevant. Suppose we pick any control station i, 
i — 1, . . . , n; arbitrarily fix the control strategy of all control stations except station i and consider 
the subproblem of finding the optimal control strategy at control station i. In principle, the history 
•Xj-t-i °f local states at control station i may give some information about the history X[^_ x 
of local states at control station j, j ^ i; and hence, may help in predicting the future control 
actions of control station j. The following proposition shows that this is not the case. Conditioned 
on the shared observations (Zi, t , Ui :t ), the local state processes {X % t , t = 1, . . . }, i = 1, . . . , n, 
evolve independently. 

Proposition 2: In the full observation model, the local states of all subsystems are condi- 
tionally independent given the history of shared state and control actions. Specifically, for any 
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realization z t G Z, x\ G X % and u\ G W of X t J and TJ\, % = 1, . . . , n, t — 1, . . . , T, we have 

P(X lrt = x w | Z 1:t = z 1:t , U w = u 1:t ) = J] P(Xj !t = 4 t I = z 1:t , U 1:t = u 1:t ) (8) 

i=l 

See Appendix A for proof. One immediate consequence of the above Proposition is the following: 
Lemma 3: Consider the full observation model for an arbitrary but fixed choice of control 
strategy g. Define R\ = pQ, Zi :t , Ui :t _i). Then, 

1) The process {R\, t — 1, . . . , T} is a controlled Markov process with control action U\, i.e., 

for any x\,x\ G X\ z u ~z t G Z, u\,u l t G U\ r\ = (x\, z l:t , ui :t _i), f\ = {x\,z 1 . t ,u 1:t -i), 
i = 1, ... , n, and i = 1, . . . , T, 

P(^ +1 = rt +1 | = r[, t , U[ :t = u\ :t ) = P(Rl +1 = \ R\ = rj, X)\ = u^) 

2) The instantaneous conditional cost simplifies as follows: 

E[c t (Z t , X t , U t ) | R\ :t = r\, t , U{, t = u\ :t ] = E[c t (Z t , X t , U t ) \ R\ = rj, U\ = «*] 

See Appendix B for proof. 

In light of Lemma 3, lets reconsider the subproblem of finding the optimal control strategy 
for control station i when the control strategy g~* of all other control stations is fixed arbitrarily. 
In this subproblem, control station i has access to R\. t , chooses U\, and incurs an expected 
instantaneous cost E[q(X 4 , U t ) \ R\. t , U[. t ]. Lemma 3 implies that the optimal choice of control 
strategy g 4 is a Markov decision process. Thus, using Markov decision theory [37], we get the 
following (recall that R\ = (XI, Z v .t, Ui : t_i) and the choice of g~ l is arbitrary): 

Lemma 4: Consider the full observation model for any arbitrary but fixed choice of control 
strategy g~ J of all control stations except i. Then, restricting attention to control laws of the 
form 

Ui = gi(Xi,Z 1:t ,V 1:t ^) (9) 

at control station i is without loss of optimality. 

Proof of Proposition 1: Lemma 4 implies that for an arbitrary choice of g~\ control 
strategies of the form (9) at control station i dominate those of the form (4). Cyclically using 
the same argument for all control stations proves the result. ■ 
Even after shedding XJ. f _ ls the data at each control station is still increasing with time. In 
the next step, we show how to "compress" this data into a sufficient statistic. 
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Step 2: Sufficient statistic for common data 

Consider Problem 1 for the full observation model and restrict control strategies of the form (7). 
Proposition 1 shows that this restriction is without loss of optimality. We use the results of [26] 
for this restricted setup. 

Split the data at each control station into two parts: the common data C t = (Zi :t , Ui :t _i) that 
is observed by all control stations and the local (or private) data L\ = X\ that is observed by only 
control station i. Note that the common information C t C C t +i is increasing with time, while the 
local information L\ has a fixed size. Thus, the system has partial history sharing information 
structure with finite local memory. Nayyar et al. [26] derived structural properties of optimal 
controllers and a dynamic programming decomposition for such an information structure. 

To present the result, we first define the following: 

Definition 1: Given any control strategy g of the form (7), let U t , t = 1, . . . ,T, denote the 
posterior probability of (Z t , X f ) given the common information C t ; i.e., for any z G Z and 
x l G X 1 , the component (z, x) of U t is given by 

n t (z,x) := PZ(Z t = z,X t = x|C t ). 

The update of U t follows the standard non-linear filtering equation. It is shown in [26] that Tl t 
is a sufficient statistic for C t \ in particular, we have the following structural result. 

Proposition 5 ( [26, Theorem 2] applied to model of Proposition 1): In the full observation 
model, restricting attention to control laws of the form 

Ul = g\{XlU t ) (10) 

at all control stations i, i — 1, . . . , n, is without loss of optimality. 

To obtain a dynamic programming decomposition to find optimal control strategies of the 
form (10), the following partially evaluated control laws were defined in [26]: For any control 
strategy of the form (10), and any realization ix t of Ii t , let 

denote a mapping from XI to U\. When n t is a random variable, the above mapping is a random 
mapping denoted by D\. Let d t = {d\, . . . , d™) and t) t = (Dj, . . . , D™). Then optimal control 
strategies of the form (10) are obtained as follows. 
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Proposition 6 ( [26, Theorem 3] applied to the model of Proposition 1): For any n t £ A(Zx 
X 1 x ■■■X n ), define 

Vt(ttt) = niin Ei[ct(Zt, Xt, Ut) | = ^t, Dt = <1t] (11) 
and fort = T- l,T-2, . . . , 1, 

K(7r t ) = minE[c t (Z t ,X t ,U t ) + V t+1 (U t+1 ) \ IL t = n u f) t = d t ] (12) 

Let ^ti^t) denote the argmin of the right hand side of V t (n t ), and 4/ l t denote the z-th component 
of tyf Then, a control strategy 

is optimal for Problem 1 with the full observation model. 

Step 3: Simplification of the sufficient statistic 

In this step, we use Proposition 2 to simplify the sufficient statistic Tl t used in Step 2, and 
thereby simplify Propositions 5 and 6. For that matter, we define the following. 

Definition 2: Given any control strategy g of the form (10), let Q\, t = 1, . . . ,T, denote 
the posterior probability of X\ given the common information C t , i.e., for any x % £ X % , the 
component x l of Q\ is given by 

e\(x i ) := p g pq = x { | c t ). 

The update of Q\ follows the standard non-linear filtering equation. For completeness, we 
describe this update below. 

Lemma 7: There exists a deterministic function F t such that 

t+1 = F t (0 t ,Z m ,U t ,D t ) (13) 

The proof follows from the law of total probability and Bayes rule. See Appendix C. 
We can now simplify the sufficient statistic f as follows: 

Lemma 8: For any z £ Z, x % £ X % , i = l,...,n, the values (z, t (x)) are sufficient to 
compute U t (z, x). 
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Proof: The proof follows directly from the definition of Ii t , Q\ and Proposition 2. Let 
C t = (Zi :t , Ui ;t _i and consider the component (z,x) of Tl t ' 

n 

n f (z,x) ( = } i[z t = z] ■ p(x t = x i z 1:t ,u 1:t _!) ( = } t[z t = z] ■ HeKx') 

i=i 

where (a) follows form the law of total probability and (b) follows from Proposition 2. ■ 
By substituting (Z u Q t ) instead of U t in Propositions 5 and 6, we get the following: 
Theorem 1 (Structure of optimal controllers): In the full observation model, restricting atten- 
tion to control laws of the form 

Ui=gl(Xi,Z t ,@ t ) (14) 

at all control stations i, i — 1, . . . , n, is without loss of optimality. 

For any control strategy of the form (14), and any realization t of S t , let 

dl(-) = ~gl(;z t ,9 t ) 

denote a mapping from X\ to IA\. When t is a random variable, the above mapping is a random 
mapping denoted by D\. Let d t = (d], . . . , d™) and D t = (Dj, . . . , D™)- Then optimal control 
strategies of the form (10) are obtained as follows. 

Theorem 2 (Dynamic programming decomposition): For any zt £ Z and 9\ £ &.(X l ), i = 
1, . . . , n, define 

Vt{zt-i 0t) = rnin E[ct(Zt, Xt, Ut) | Zt = ^t, ©t = #Tj Dt = dr] (15) 
and for t = T- 1,T- 2, . . . , 1, 

Vfe t ) = min E[ct(Z t , X t , U t ) + V t+1 {U t+1 ) \ Z t = z t , @ t = B u D 4 = d t ] (16) 

d t 

Let Sk t (zt,0t) denote the argmin of the right hand side of V t (z t ,6 t ), and denote the z-th 
component of ^> t - Then, a control strategy 

gi{x\,z u t )e¥ t {z tl t ){x\) 

is optimal for Problem 1 with the full observation model. 
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IV. Main result for the partial observation model 



In this section, we derive structure of optimal control laws and a dynamic programming 
decomposition for the partial observation model. As in the full observation model, we cannot 
directly use the results of [26] because the local observations Y{. t at each control station are 
increasing with time. To circumvent this difficulty we follow a three step approach, similar to 
the one taken for the full observation model, and proceed as follows: 

1) Use a person-by-person approach to show that Z\{x) := PpQ = x \ Y{. t , Ux :t -i) is a 
sufficient statistic for the history of local observations at control station i at time t. Thus, 
for any control strategy of control station i that uses (Y^. t , Z 1:t , Ui :t _i), we can choose a 
strategy that uses only Z v _ t , Ui :t _i) without loss of optimality. 

2) Steps 2 and 3 are similar to those of the full observation model with X\ replaced by SJ. 
Now, we describe each of these steps in detail. 

Step 1: Sufficient statistic for local observations 

In this step, we find a sufficient statistic for the local observations Y{. t at control station i. 
For that matter, we define the following: 

Definition 3: Given any control strategy g of the form (5), let % — 1, . . . , n, t — 1, . . . , T 
denote the posterior probability of the local state X\ of substation i given all the information 
(Y{, t , Zi- t , Ui: t _i) at control station i, i.e., for any x 1 G X % , the component x l of E l t is given by 



where (a) follows from the independence of {W®, t — 1, . . . , T} from {W^, t — 1, . . . , T}. 
The update of E l t follows a non-linear filtering equation as shown below. 
Lemma 9: For every i, % — 1, . . . , T, there exist a deterministic functions F£ such that 



The proof follows from the law of total probability and Bayes rule and is similar to the proof 
of Appendix C. 

The main result of this section is the following: 

Proposition 10: In the partial observation model, restricting attention to control laws of the 



zKx 1 ) ■.= p g pq = x 4 i Yl t , z 1:t , u 1:t _o ^ p g pq = x* I Yl t , z 1:t _ l7 u 1:t _o 




F}(Z\,Y: +1 ,Z t ,lJ t ). 



(17) 



form 



Ui = gi(El,Z 1]t ,V 1 ^ 1 ) 



(18) 
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at all control stations i, % = 1, . . . , n, is without loss of optimality. 

The intuition behind this result is as follows. Arbitrarily fix the control strategies g_; for all 
control stations other than i. In the full observation model, (X\, Z 1:t , Ui :t _i) is a state sufficient 
for performance evaluation at control station i (Lemma 4). In the partial observation model, 
component X\ of this state is not observed. So, the posterior distribution E l t on X\ given all the 
data available at control station i should be a sufficient statistic for X\ [38]. 

To show that the above intuition is true, we need to establish two conditional independence 
properties. 

Proposition 11: Proposition 2 is also true for the partial observation model for an arbitrary 
but fixed choice of control strategy g of the form (5). 

Proposition 12: In the partial observation model, the posterior probability E\ of the local 
states of all subsystems are conditionally independent given the history of shared state and 
control actions. Specifically, for any Borel subsets El of A(X l ), E f = [E], . . . ,-£?"), u\ £ W, 
z t £ Z, i — 1, . . . , n and t — 1, . . . , T, we have 

n 

P(H 1:t £ E 1:t | Z ld = z 1:t , U 1:t = u 1:t ) = J] P(Hi :t £ E\, t | Z 1A = z 1:t , U 1:1 = u 1:t ) (19) 

i=i 

These results are proved in Appendices D and E. 

An immediate consequence of Proposition 1 1 and Lemma 9 is the following (see Appendix F 
for proof). 

Lemma 13: Lemma 3 is also true for the partial observation model with R\ defined as 

(SJ, Zi :t , Ul:t_l). 

Proof of Proposition 10: The result of Proposition 10 follows from cyclically repeating an 
argument similar to the argument after Lemma 3. ■ 

Steps 2 and 3: Sufficient statistic for common data and its simplification 

Compare Proposition 1 of the full observation model with Proposition 10 of the partial 
observation model. The posterior probability E\ in the latter model plays the role of local state 
X\ in the former model. This suggests that we may follow Steps 2 and 3 of the full observation 
model in the partial observation model by replacing X\ by E l t . Following this suggestion, define: 
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Definition 4: Let II t denote the posterior probability on (Z t , Hf) given the common informa- 
tion C t , i.e., for any z G Z and any Borel subsets E % of A(X t ) and E = (E 1 , . . . , E n ), 

fl t (z J E) = F(Z t = z,E t eE\C t ) (20) 

Definition 5: Let 0J, t = 1, . . . , T, denote the posterior probability of E l t given the common 
information (Z 1:t , Ui :t _i), i.e., for any Borel subset E l of A(X l ), 

Now, by following the exact same argument as in Steps 2 and 3 for the full observation model, 
we get that Propositions 5 and 6 and Theorems 1 and 2 are also true for the partial observation 
model if we replace U t and Q\ by fl t and & t , respectively. 

V. Extension to infinite horizon 

In this section, we extend the result of structural result of Theorem 1 and the dynamic 
programming decomposition of Theorem 2 to a time-homogeneous system that runs for an 
infinite horizon under the discounted cost optimality criterion. 

In the model of Section II-A, assume that the plant function fl, i = 0, . . . ,n, and the cost 
function c t are time-invariant and are denoted by f and c, respectively. Furthermore, in the partial 
observation model assume that the observation function l\, % = 1, . . . , n are time-invariant and 
are denoted by t. Such a system is called a time-homogeneous system. 

Assume that the system runs indefinitely. Define the performance of a control strategy g := 
(gi,g 2 ,...) as 



Jp(g) := lim E [Y^-^X^Ut 

T— >-oo L 

t=l 



(21) 



where f3 £ (0, 1) is called the discount factor. 

We are interested in the following optimization problem. 

Problem 2: Given a discount factor (3, the distributions Pz, Px*\z> Pw^ P\v* °f tne initial 
shared state, initial local state, plant disturbance of subsystem i, and observation noise of 
subsystem i (for the partial observation model), i = l,...,n, and the cost functions c, find 
a control strategy g that minimizes the expected discounted cost given by (21). 

Since the sufficient statistic in Theorem 2 takes value in a time-invariant space, the results of the 
finite horizon system extend to infinite horizon in the usual manner. Proposition 2 remains valid 
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for an infinite horizon system as well. Consequently, so do the structural results of Proposition 1 . 
Therefore, we can use the approach of [26] to obtain the infinite horizon version of the dynamic 
program of Proposition 6. Using Lemma 8, the dynamic program simplifies as follows: 

Theorem 3: There exists an optimal control strategy that is time homogeneous. An optimal 
choice of the partially evaluated control strategy d of g is given by solution of the following 
fixed point equation 3 for the full observation model: 



d 



V{z, 6) = mm E c(Z t , X u U t ) + (3V(Z t+1 , @ t+1 ] 



z,@ t = e,T> t 



(22) 



and, by replacing Q\ by Q\ in the above equation for the partial observation model. (The above 
equation is time homogeneous; we are using time t for ease of notation.) 

VI. An example: Multiaccess broadcast 

In this section, we reconsider the multiaccess broadcast system described in Section II-B and 
show how the results of this paper provide new insights for that system. 

A. The model 

Recall that a two-user multiaccess system consists is a special case of the full observation 
model with X 1 — W = W % t = {0, 1}, i = 1,2, and Z — 0. The state dynamics of user 1 are 
given by: X} +1 = max(X t 1 + U\ ■ (1 — U 2 ) + W}, 1). The dynamics of user 2 are symmetric dual 
of the above. Each user chooses a transmission decision as U\ = gl(X{. t , Ui^_i) where only 
actions U\ < X\ are feasible. The per unit reward function c(x, u) = u 1 © u 2 , where © means 
binary XOR. The objective is to maximize the total average reward over an infinite horizon given 
by 



T 

J(g)= lim E\J2U?®U? 

T-kx> 1 L — ' 
i=l 



(23) 



which corresponds to maximizing the average throughput. 

The case of symmetric arrivals (p 1 = p 2 ) was considered in [1], who found a lower bound 
on performance by finding the best window protocol strategies. An upper bound was for the 
symmetric case was computed numerically in [2] by considering a more informative information 
structure. The analytic lower bounds of [1] match the numerical upper bound of [2]; hence, 

3 Due to the discounting of future costs, (22) has a fixed point that is unique. 
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the strategy proposed in [1] is optimal. A dynamic programming decomposition for the general 
model was presented in [28]. 

The multiaccess broadcast system corresponds to the full observation model. Therefore, the 
results of this paper 4 provide a structure of optimal transmission policies and a dynamic pro- 
gramming decomposition. For the symmetric arrival case (p 1 = p 2 ), we solve the corresponding 
dynamic program in closed form, and give an analytic derivation of the optimal strategy. 

B. Structure of optimal transmission policies and dynamic programming decomposition 

Since Z t = 0, the information state & t = (Oj, . . . , 0") of Definition 2 simplifies to Q l t (x) = 
F e (Xl = x | U 1:t _i). Theorem 1 implies that there is no loss of optimality in restricting attention 
to control strategies of the form U\ = g\(Xl, t ) and Theorem 2 gives the corresponding dynamic 
program to find the optimal transmission strategies. 

To succinctly describe the dynamic program, we simplify the notation as follows: 

1) The functional map d\ from X 1 to IA % is completely specified by d\(l) because rfj(0) must 
be zero as u\ = d\{x\) < x\ and X* = U l = {0, 1}. We denote d\{l) by S\ E {0, 1}. Then, 

4 = 4- 4 

2) Since Q\ is a probability distribution of a binary valued random variable, it is completely 
specified by its component 6J(1), which we denote by Q\. 

To present the update of Q t , we define the following operators. 

Definition 6: Let Ai, i = 1,2, be an operator from [0, 1] to [0, 1] defined for any q E [0, 1] as 
Aiq = 1 — (1— p l )(l— q) where p l is the arrival rate at the queue i. Then, A™q = 1 — (1— p % ) n (l— q), 
and for any q E (0, 1), A?q < A™ +1 q. 

Lemma 7 shows that the information state q t = (q},q 2 ) updates according to a non-linear 

4 Although in Section V, we only considered the infinite horizon discounted cost criterion, the same argument also works for 
the average reward per unit time. 
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F((q\q 2 ), u,s) = { 



if s = (0,0) 
if s = (1,0) 
if s = (0,1) 

if s= (1,1) and u= (1,1) 
if s= (1,1) and u ^ (1,1). 



(24) 



filter F(q t , u t , s t ) where 

'(A iq \A 2 q 2 ) 
(p\A 2 q 2 ), 

(1,1), 

Substituting this update function in the infinite horizon average reward per unit time version of 
the dynamic program of Theorem 2, we get 

Proposition 14: For the two-user multiaccess broadcast system, there is no loss in optimality 
in restricting attention to time-homogeneous transmission strategies of the form 

Ui=gl(XlQt) = Sl(Q t )-Xl. 

An optimal strategy of such form is given by the solution of the following fixed point equation: 



v{q ,q ) + J* = max{u 10 (g ,q ),v 01 (q ,q ),v n (q ,q )} 



(25) 



where J* denotes the average reward per unit time, v(q 1 ,q 2 ) is the relative value function at 
(q 1 , q 2 ) and Vij(q x , q 2 ) is the relative value-action function at (q 1 , q 2 ) when (s 1 , s 2 ) is chosen to 
be i,j G {0, 1}, i.e., 

v w (q\q 2 ) = q 1 + v(p\A 2 q 2 ), 
v 01 {q\q 2 )=q 2 + v(A iq \p 2 ), 

Mq\ <i 2 ) = Q 1 + (i 2 - 2g'g 2 + gV^i, l) + (l - <zYMp\p 2 )- 

Some remarks: 

1) We ruled out the action (s 1 , s 2 ) = (0, 0) because it is dominated by the action (s 1 , s 2 ) = 
(1,0). 

2) The information state q takes values in the uncountable set [0, l] 2 . However, the form of 
the non-linear filter F (24) implies that the reachable set of q is countable and is given 
by 

K = {(1, 1), (l,p 2 ), (p\ 1), (p\p 2 )} U {(p\ Alp 2 ) : n G IN} U {(A n lPu p 2 ) : n G IN} 
Thus, we need to solve the dynamic program of Proposition 14 only for q G TZ. 
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3) A similar dynamic programming decomposition for the two-user multiaccess broadcast 
channel was derived in [28], but [28] did not completely exploit the information structure 
of the system. In particular, [28] a priori restricted attention to transmission strategies of the 
form (9) while we show that such a restriction is without loss of optimality. Furthermore, 
the dynamic program in [28] is similar to that of Proposition 6 while we use a simpler 
form of the dynamic program (Theorem 2). As shown above, the reachable set of the 
information state is countable for this simpler form of the dynamic program while such a 
simplification was not possible for the dynamic program in [28]. 

C. The symmetric arrival case 

Assume that both users have symmetric arrivals, i.e., p 1 = p 2 . Then the transformation A\ is 
the same as A 2 , and we denote both by A. Since the system is symmetric for both users, we have 
that for any g 1 ,^ 2 , the relative value functions v(q 1 ,q 2 ) and v(q 2 ,q l ) are the same. Therefore 
Vijil 1 ^ 2 ) — v ji{ ( fi an d consequently, the optimal coordination policy is also symmetric, 
i.e., h(q\q 2 ) = h(q 2 ,q l ). 

Using this symmetry, we find a closed form solution of the dynamic program. To describe 
the solution, we first consider the following polynomial and some of its properties. Let ip n {x) = 
1 + (1 - x) 2 - (3 + x){l - x) n+1 . Note that 

1) <^n(0) = —1 and y? n (l) = L Thus, (p n has a root a n that lies in the interval [0, 1]. 

2) cp n+ i(x) = (l — x)ip n (x) +x(l + (1 — x) 2 ). Thus, (p n +i(oi n ) is positive. Recall that y? n (0) = 
— 1. Thus, a n+ i lies in the interval [0, a n ]. Hence the sequence {«„} is decreasing. 

Let r denote the root of x = (1 - xf. Then, r w 0.38196 > on « 0.34727. 
Theorem 4: For the symmetric arrival case, p 1 = p 2 = p, the optimal solution J* to the 
dynamic program of Proposition 14 is given by 

,.J(i-d-p) 2 ). (26) 

lp(l — (2p 2 — 1))/(1 +p 2 +p 3 ), otherwise. 
The corresponding optimal strategy h*(q l ,q 2 ), (q l ,q 2 ) £ 1Z, is given by 
1) For p > t, 

(1,0), if g 1 > g 2 , 

h*(q\q 2 ) = 



(0,1), if < g 2 

(1,0) or (0,1), hV = g 2 
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h*(q\q 2 



2) For p < r, let n E IN be such that a n+ i < p < a n . Then, 

' (1, 1), if q 1 < A n p and q 2 < A n p, 

(1,0), if q 1 > m&x(A n p, q 2 ), 

(0, 1), if q 2 > max(A n p, q 1 ), 

{(1,0) or (0,1), hV = g 2 = l. 
The proof is presented in Appendix G. 

Although the optimal policy looks complicated with different behavior depending on the value 
of p, it has only two modes of operation. When p > r, the set of states {(p,Ap), (Ap,p)} is 
absorbing and forms a recurrence class in 1Z. Within this recurrence class, the optimal policy is 
a round-robin policy. When p < r, the set of states {(1, 1), (p, Ap), (Ap,p), (p,p)} is absorbing 
and forms a recurrence class in 1Z. Within this recurrence class, the optimal policy is identical 
for all p < r. The system starts with (q l ,q 2 ) = (p,p) and chooses (s 1 ,^ 2 ) = (1,1), which 
means that each user transmits if it has a packet. If no collision occurs, then the next state 
remains (p,p). If a collision occurs, (q 1 ^ 2 ) = (1, 1) and both users know that both of them 
have a packet. So, they simply empty their buffer one by one, say first (s 1 ,^ 2 ) = (1,0), and 
then (s 1 , s 2 ) = (0, 1), and go back to "transmit if you have a packet" action: (s 1 , s 2 ) = (1, 1). 
This policy is identical to the optimal window protocol proposed in [1]. Unlike [1], who showed 
that this strategy is the best transmission strategy when restricted to window protocols, we have 
shown that this strategy is the best strategy over the class of all transmission protocols. 

VII. Discussion and Conclusion 

Systems with control sharing information structure arise in a variety of communication ap- 
plications. In this paper, we presented a three step approach to identify sufficient statistic and 
dynamic programming decomposition for coupled subsystems with control sharing. 

The general decentralized control system with control sharing does not admit a tractable 
dynamic programming decomposition. Our solution approach works because the subsystems are 
coupled only through control actions XJ t and shared state Z t , but not through local states X t . In 
particular, if the system dynamics were of the form 

Xi +1 = f t (Z t ,X t ,V t ,Wi) (27) 
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instead of (2), then Propositions 2 and 11 will fail, and consequently, Step 1 of our approach 
would not simplify the control strategies. 

In addition, the final sufficient statistics & t and t derived in Step 3 are simpler than the 
general sufficient statistics Tl t and fl t , which are based on [22], derived in Step 2. In particular, 
U t E A(A' 1 x ■ ■ • x X n ), so its size increases exponentially with the number of subsystems, 
while fl t G A(A' 1 ) x ■ • • x A(X n ), so its size increases linearly with the number of subsystems. 
This additional simplification is also a consequence of the specific form of system dynamics and 
would fail if the system dynamics were of the form (27). 

In itself, it is not surprising that a simpler dynamical model makes the system easier to design. 
However, it is important to understand why this particular simplification dynamical model works; 
such an understanding will allow for similar simplifications for general non-classical information 
structures as well. 

The system dynamics given by (2) do not remove the incentive to signal. In particular, control 
station i at time t + 1 does not know all observations of control station j at time t. Hence, 
control station j has an incentive to signal its local observation to control station i through 
its action [//. Thus, the model is not partially nested [14] (or quasi-classical [39]). Even after 
taking the conditional independence results of Propositions 2 and 1 1 into account, the signaling 
incentive is still present due to the cost coupling. Knowing the local state X{ of subsystem j 
will help control station i to improve its choice of action X\ in order to minimize the expected 
cost to go E[^J =4 c s (Z s , X s , U a )]. Thus, the model is not stochastically nested [16] (or P-quasi- 
classical [39]). 

We may think of the system dynamics of the form (2) as a sufficient condition to obtain a 
time-invariant sufficient statistic for the local information at each control stations (Propositions 1 
and 10). Once such a sufficient statistic is identified, the model reduces to a partial history sharing 
information structure with local information taking values in a time -invariant space. Thereafter, 
one can use the results of [22] obtain a sufficient statistic of the common information at all 
control stations. 

Finding such sufficient conditions (to extend the applicability of a specific solution technique to 
more general models) is a recurring theme in decentralized control. A similar approach has been 
used in [10] to generalize the solution approach of [8] to two controller teams where at least one 
controller has finite memory; in [16] to generalize the solution approach of [14] to stochastically 



August 9, 2012 



DRAFT 



24 



nested information structures; in [40] to generalize the solution approach of [28] to broadcast 
information structures; and in [39] to generalize the solution of classical and quasiclassical 
information structures to P-classical and P-quasiclassical information structures. The model 
and results of this paper present such a sufficient condition to extend the results of [22] to 
control sharing information structure. 
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Appendix A 
Proof of Proposition 2 

For simplicity of notation, we use P(zi:t, x 1:t , u 1:t ) to denote V(Zi, t = z 1:f , X 1:t = x 1:t , Ui :t = 
Ui : t) and a similar notation for conditional probability. Define: 

. a\ := P(uj | zirt, x\. t , uirf-i), Pi := P(xj \ Zt-i, 4-u ^t-i), 7t : = F 0< I Zt-i, u t ~i); and 

• 4 := nl=i "i, Bt : = nLi r, : = nU 7- 

From law of total probability it follows that: P(zi :t , x 1:i , ui :t ) = ^ YYi=i A\Bl^ T t . Summing 
over all realizations of x 1:t and observing that A l t and B\ depends only on (z\ :t , x\. t , u 1:t ), we 
get 

t(z,„ u 1:l) = e e • • • e ( n A > B t) r < = f ri ( E ^i) ) r - 

^l-.t ^l:i l:t x l:t 

Thus, using Bayes rule we get 



P(x 1:t I * l!tj u 1:t ) = 7 ^ \ (28) 



Summing both sides over x\. t , i ^ j, we get 



A3 f>3 

P(xj !t |z 1:t ,u 1:t ) = ^ r (29) 



The result follows from combining (28) and (29). 
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Appendix B 
Proof of Lemma 3 

For ease of notation, we use P(f l t+1 | r\. t , u\. t ) to denote F(R l t+1 = f\ +l \ R\. t = r\. t , U\. t = u\. t ) 
and a similar notation for other probability statements. Consider 

P(f j +1 | r[ lt , u\. t ) = P(x* +1 | x' t , z t , u t ) ■ P(z t +i | z t , u t ) ■ t [ui :t _i = ui :t _i] = uj] 

• t[z llt = z 1:t ] ■ P(u i " J | x\, t , Zi..t, u 1:t _i, u\) (30) 

Simplify the last term of (30) as follows: 

P (uT I X \:V Z l:t, Ui:t_i, TiJ) = P(up I ^l:t» *l:t» "bt-l) 

= P(u~ i I X^*, Z 1:t , Ui :t _i) • P(X^J | X\. t , Z 1:t , U 1:t _i) 

i ]T P(V | X^, 2Tl :t , Ui :t _i) • P(x£ | ZlA , Ui rt _i) = P(V I Zl:t, Ulrf-l) (31) 

where (a) is true because is determined by x\. t Zi :t and Ui : t_i and (6) follows from Propo- 
sition 2. Substituting (31) in (30), we get 

P(ftu | r[. t , u[. t ) = P(xj +1 | x\, z t , ut) ■ P(5 m | 5 f , u t ) • l[iii :t _i = Ui :t _i] 

• t[z 1:t = Zi st ] ■ t[ul = Tij] ■ P(uf | Si*, Uirf-O 

= P(xJ+i, ui : t I x\, u], z 1:t , uirt_i) = P(ft +1 I r{, «t) (32) 

This completes the proof of part 1) of the Lemma. 

To prove part 2), it is sufficient to show that F(z t , x f , u t | r\. t , u\. t ) = F(z t , Xt, u t | r\, u\). 
Consider 

P(zt,xt,ut|r 1: t,<t) = l[{zt,xi,ui) = (zt,x\,u\)] ■ P(5q *, u t _i | x\. t , u\,z u u lxt -i) 
- m^xlul) = {z u x\,u\)] ■ P(xp,V | Z XA ,VL 1A - X ) 

= P(x t ,Ut|r*X) (33) 

where (c) follows from an argument similar to (31). 5 This completes the proof of part 2) of the 
Lemma. 

5 Recall that denotes the vector (x\, . . . , a;J _1 , x\ +1 , . . . , x"). 
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Appendix C 
Proof of Lemma 7 

Consider the system for a particular realization (z\ : t, x i :T , u 1;T , d 1:T ) of (Z 1:T , X 1:T , Ui :r , D 1:T ) 
For ease of notation, we use P(x\ +1 | zi :t +i, n± :t , di :t ) to denote P(X^ +1 = x l t+1 \ Zi :t+1 = 
zi:t+i> u i:t = Ui : i,Di. t = di :f ). Define 

^■(^t+l) x tj %l:t+l, Ulrf, di : t) := P(xJ +1 , X t , U t | Zi : £, Ui : f_i, di : t); 

x t , z t , d t , t ) := P(x] +1 | x\, zt, u t ) ■ P(zt+i | x t) z t , u t ) ■ B\ 
The system dynamics and Proposition 2 implies that 

A(xl +1 ,x t , zi:t+i,ui:t, d 1:t ) = B(x l t+1 ,y:t,zt + i,zt,dt,Ot)t[u t = d t (x t )] (34) 
Consider component-z of the realization f+ i of ©t+i. 

= >. wFh j 7T\ ='■ F t ( t, Zt+i, u f , d t )(x t+1 ) (35) 

{xt: *t)=u t} S fei> x " *W» d " 

where (a) follows from (34). Combining (35) for alH, % — 1, . . . , n, proves the Lemma. 

Appendix D 
Proof of Proposition 1 1 

The proof is similar to proof of Proposition 2. As before, for ease of notation, we use 

P(>i;t,x 1:f ,y 1:t ,u 1:t ) to denote P(Z 1:t = « lrt ,X 1:t = x 1:t , Y 1:t = y l!t ,Ui !t = u 1:t ). Define 

. a\ = P(u l t | zirt,j/i :t ,u 1:t _i), $ = P(xJ | zt-i,a?i-n u i:t-i)» 7t = I z t-i,^t), S\ = 
~P(yl | x\); and 

• 4 ■■= ni=i < bi : = nu r * : = rd=i 7- A * : = nLi £ 

From the law of total probability, it follows that P(zi : t, Xi :t , yi :t , Ui :t ) = fc^dliLi ^-^t Ajjr*- 
Sum over the realizations of y 1:t and observe that A£ and depend on y 1:t only through y\. t . 
This gives, 

p( Z1 .,, X1:t , Uw ) = e e e ( n ^ a r * = ( n ( e a * a 5 r * 
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Now sum over x 1:i and observe that B\ and A\ depend on x 1:t only through x\. t . 

p (2l:i , Ul:t) = E E • E ( fl (E 4a;)b;) r, 

= (n(E B «(E- 4 ; A ;))V' 

Thus, by Bayes rule, we get 



1=1 T i 

x l:t »l:t 



- Bt(j: yi AiAi 

P(x 1:t | u 1:t ) = TT ^ (36) 



Summing both sides over x\. t , i ^ j, we get 



F(x{. t | z 1:t , u 1:t ) = ^ (37) 



The result follows from combining (36) and (37). 

Appendix E 
Proof of Proposition 12 

Consider 

P(Si :t G Ei :t | 2i:t, 111*) = / d P(£i :t | Z V .t, U la ) 

JEi-t 

From Proposition 1 1 and law of total probability, we get 

dP(6:t I Zl:t, U 1:t ) = ( II d P ^ I V™ Z ^ Ul: *) ' P ^ I ' P ( X 1 

Xi:t,yi:t ^ t=l 

= II ( E d P ^ I WW, Ul: *) • P (VW I *1 *) • P (4* I "I*)) 

*=1 ^xi :t ,yi;i ' 

which completes the proof of the Proposition. 

Appendix F 
Proof of Lemma 13 

For ease of notation, we use dF(fl +1 \r\. t ,u\. t _ 1 ) to denote d¥(R l t+1 = f\ +l \R\. t = r\. t , 
U\, t _i = u\:t-i) an d a similar notation for other probability measures. Consider 

dP(r~j +1 1 r\, t ,u\, t ) = Y^nii+i = miyi +1 ^u t )] ■ nv\ + i 1 ■ i 

X t:t + l'Dt+l 
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• P(St + l I Z t ,U t ) ■ t[z 1:t = Zi-.t] ■ t[u\ = U l t ] ■ l[Ul :t _l = U 1:t _i] 

■^(sD-P^I^^u.mX) ( 38 ) 
Simplify the last term of (38) as follows: 

P(V I £i :t ,2l:t,Ui :t _iX) 

= J>(tT l | y^^u^) ■ P( yi -j | Xl -j) ■ P(x- | z l!t ,u 1:t _i) 



— 2 —2 

X l:i.yi:t 



= P(uT i Uw,ui : ^i) (39) 
Substituting (39) in (38) and simplifying, we get part 1) of the Lemma: 

d F(fl +1 I r* :t , <J = d P(fj +1 I r«, u{) (40) 
The proof of part 2) is similar to (33). 

Appendix G 
Proof of Theorem 4 

We introduce a short hand notation that exploits the symmetry of the problem and the fact 
that the reachable set 1Z is countable. Define 



* 

V 


= v(l,l), 




= v(p,p), 


v" 


= v(p,A n p), n E IN, 




= «(p, 1) 


a* 


= «io(l, 1), 


a 


= v w (p,p), 


a" 


= v w (p,A n p), n E IN, 


a°° 


= Uio(p, 1) 


If 


= «oi(l, 1), 


b° 


= v 01 (p,p), 


b" 


= v 01 (p,A n p), n E IN, 


6°° 


= Uoi(p» 1) 


c* 


= ull(l,l), 


c° 


= v n (p,p), 


c" 


= u u (p,A n p), n G IN, 


c°° 


= uii(p, 1) 



Notice that fi (A n p,p) = v i(p, A n p) = b n and i7 i (^4 n P, £>) = ^io(p,^4 n p) = a n - 
With the above notation, the dynamic program of Proposition 14 can be written as 

v n + J* = max{a n , b n , c n }, n E {*, 0, 1, 2, ... , oo} (41) 

where 

a* = 1 + 6* = 1 + c* = (42a) 

a°=p + v 1 , b°=p + v\ c° = 2p(l -p) +p 2 v* + (1 -p 2 )v° (42b) 
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a n =p + v n+1 , b n = A n p + v\ c n = p + A n p-2p- A n p + p ■ A n p ■ v* (42c) 

+ (1-p- A n p)v° 

a 00 =p + v 00 , b°° = l+v\ c°° = 1 -p + pv* + (1 -p)v° (42d) 

Lemma 15: A solution of the fixed point equations of (42) is given by the following: 

1) For p G (r, 1] 

J* = Ap, v* = 2- Ap, v° = p, v n = A n p, n6K, v°° = 1. (43) 

2) For p G («i, r] 

J* = Ap, v* = 2 - Ap, v° = 1 - Ap, w n = A n p, n G M, u 00 = 1. (44) 

3) For p G (« m+ i, a m ], m G IN, define = 1 + x 2 + x 3 . Then, 

j. =1 ,(l_ig), „- = 2-/*, „» = 2-p-i±i^!, „~ = 1. (45a) 



and 



w n , if n < m, 
w n = < — n G IN, (45b) 

A n p, if n > m; 



where 

1 - v)A n ~ 1 v (—-a 



w n = (l- p)A n ~ 1 p ^— - (1 - p)J + t- L 
Note that = J*. 

The proof follows from elementary algebra. For completeness, we include the details for each 
case below. 

Case 1: p G (r, 1] 

We show that the values of Lemma 15 satisfy the dynamic program of (41) and (42), by 
considering the four cases separately. 

1) a* = b* = 2, c* = 2 - Ap. Hence, either action (0, 1) or (1,0) is optimal at state (1, 1) 
and v* + J* = a* = b* = 2. 

2) a = b° = p + Ap, and b° — c° = p 2 (p — (1 — p) 2 ) which is positive for p > r. Recall that 
r is the root of x = (1 — x) 2 . Hence, either action (0, 1) or (1, 0) is optimal at state (p,p) 
and v° + J* = a° = b° = p + Ap. 
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3) Consider n £ IN. a n = p+A n+l p and b n = A n p+Ap. Thus, b n -a n = p{l-p)-A n ~ 1 p > 0. 
Moreover, b n — c n = p 2 [(3 — p)A n p — 1] > p 2 [{2> —p)p — 1] which is positive for p £ (r, 1]. 
Thus, the action (0, 1) is optimal at state (p, A n p) (and by symmetry, the action (1, 0) is 
optimal at state (A n p,p)) for n £ IN and v n + J* = b n = A n p + Ap. 

4) a 00 = 1 + p, b°° = 1 + Ap. Thus, b°° > a°°. Moreover, b°° - c°° = p ■ Ap > 0. Thus, 
the action (0, 1) is optimal at state (p, 1) (and by symmetry, the action (1,0) is optimal at 
state (l,p)), and v°° + J* = b°° = 1 + Ap. 

Case 2: p £ r] 

We show that the values of Lemma 15 satisfy the dynamic program of (41) and (42), by 
considering the four cases separately. 

1) a* = b* = 2, c* = 2 — Ap. Hence, either action (0, 1) or (1,0) is optimal at state (1, 1) 
and v* + J* = a* = b* = 2. 

2) a = b° = p + Ap and c° = 1. Thus, c° - a = (1 - p) 2 - p > for p e (cm, r]. Recall 
that t is the root of x — (1 — x) 2 . Hence, action (1,1) is optimal at state (p,p) and 
v o + J* = c° = 1. 

3) Consider n £ IN. a n = and b n = A n p+Ap. Thus, b n -a n = p(l-p) ■ A n ~ x p > 0. 
Moreover, c n = (1 — p)A n p + p — Ap + 1 and, thus, b n — c n = 2Ap + p ■ A n p — p — 1 > 
2Ap + j> ■ Ap — p — 1 = (fi(p), which is positive for p £ (ai,t]. Thus, the action (0, 1) is 
optimal at state (p, A n p) (and by symmetry, the action (1,0) is optimal at state (A n p,p)) 
for n £ IN and v n + J* = b n = A n p + Ap. 

4) a°° = 1 + p, b°° = 1 + Ap. Thus, 6°° > a°°. Furthermore, c°° = 2 - Ap and 6°° - c°° = 
2Ap— 1 = — </? (l — p), which is positive for p > 1 — a . Since cm > 1 — ct , we have that 
for p £ (cm,r], the action (0, 1) is optimal at state (p, 1) (and by symmetry, the action 
(1, 0) is optimal at state (l,p)) and v°° + J°° = b°° = 1 + Ap. 

Case 3: p £ (a m+ i, a m ], m £ IN 

We show that the values of Lemma 15 satisfy the dynamic program of (41) and (42), by 
considering each case separately. Recall C,(x) = \ + x 2 + x^ . 

1) a* = b* = 2 and c* = 2 — J*. Hence, either action (0, 1) or (1, 0) is optimal at state (1, 1) 

and v* + J* = a* = b* = 2. 
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2) a = b° = p + w 1 and c° = 2p — p 2 J* + (1 — p 2 )v°. Hence, a — c° = — pVo(p)/C(p) 
which is positive for p < a . Thus, for p G (a m+ i,a m ], either action (0, 1) or (1,0) is 
optimal at state (p, p) and v° + J* = c° 

3) Consider n < m. Then, c n — a n = —pA n pLp (p)/((p), which is positive for p G [0,a ] 
and thus for p G (a m+ i,a m ]. Moreover, c n — b n = —p 2 (p n (p)/((p) which is positive for 
p G (0, a n ] and hence for p G (a m+ i, a m ). (Recall that a n forms an decreasing sequence.). 
Thus, the action (1, 1) is optimal at state (p, A n p) for n < m (and by symmetry for state 
(A n p, p) for n <m) and v n + J* = c n . 

4) Consider n = m. Then, a m = p + A m+1 p and c m — a m = —(1 — p) m ^i{p) + (1 — 2p — p 3 ). 
The first term is positive for p G [0, a m ] and since the second term is larger than y?i(p), the 
second term is also positive in that interval. Moreover c m — b m = —p 2 (p n (p)/C(p) which 
is positive for p G [0, a m ]. Thus, both terms are positive for p G (a m+ i,a m ]. Hence, the 
action (1, 1) is optimal at the state (p, A m p) (and by symmetry for the state (A m p,p)) and 
v m + J* = c m . 

5) Consider n > m. Then, a n = p + A n+l p and b n = A n p + J*. Thus, b n — a n = 
— p(fo(p) / ((p) — p(l — p) n+1 > —pifo(p)/(p — p(l — p) = —p 2( fi(p)/ ( P which is positive 
for p G [0, ai] and the second term is always positive. Moreover, b n — c n = p 2 <£ n (p)/((p) 
which is positive for p G (a n , 1] and hence for p G (a m +i, a m ] C (a n , 1]. (Recall that a n 
forms an decreasing sequence.). Thus, the action (0,1) is optimal at state (p, A n p), for 
n > m (and by symmetry the action (1,0) is optimal at state (A n p,p) for n > m) and 
v n + J* = b n . 

6) a°° = p + J* and b°° = 1 + w 1 . Thus, b°° — a°° = —p<fo(p)/C(p)> which is positive for p < 
a . Moreover, c°° = 1+p— pJ*+(l— p)v° and, thus, b°° — c°° = p 2 (l + (l— p) 2 )/((p) which 
is always positive. Thus, for p G (a m+1 ,a m ], the action (0,1) is optimal at state (p, 1) 
(and by symmetry the action (1, 0) is optimal at state (l,p)) and v°° + J°° = b°° = 2 J*. 
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